Research and optimization of the Bloom filter algorithm in Hadoop

نویسنده

  • Bing Dong
چکیده

Research and optimization of the Bloom filter algorithm in Hadoop An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Teradata is not synced with Hadoop, e.g. due to data loss during the communication, sync and copy process, it will cause the data to disaccord. A survey shows that except for the algorithm provided by Hadoop, the Bloom filter algorithm can be a good choice for data reconciliation. MD5 hash technology is applied to reduce the amount of data transmission. In the experiments, data from both sides was compared using a Bloom filter. If there was any data loss during the process, different primary keys could be found. The result can be used to track the change of the original data. During this thesis project, an experimental system using the Mapreduce mode of Hadoop was implemented. For the implementation, real data was used and the parameters were adjustable to analyze different schemes (Basic join, CBF, SBF and IBF). Basic knowledge and the key technology of the Bloom filter algorithm are introduced initially. Then the thesis systematically expounds the existing Bloom filter algorithms and the pros and cons of each. It also introduces the principle of the Mapreduce program in Hadoop. In the next part, three schemes, all in concordance with the requirements are introduced in detail. Then in the 4th phase, the implementation of schemes in Hadoop as well as the design and implementation of the testing system are introduced. In the 5th phase, testing and analysis of each scheme is carried out. The feasibility of the schemes is analyzed with respect to performance and cost using experimental data. Finally, conclusions and ideas for further improvement of the Bloom filter are presented.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Cuckoo Filter Modification Inspired by Bloom Filter

Probabilistic data structures are so popular in membership queries, network applications, and so on. Bloom Filter and Cuckoo Filter are two popular space efficient models that incorporate in set membership checking part of many important protocols. They are compact representation of data that use hash functions to randomize a set of items. Being able to store more elements while keeping a reaso...

متن کامل

Data Optimization Techniques using Bloom Filter in Big Data

Due to the advent of new technologies, devices, and communication means like social networking sites, the amount of data produced by mankind is growing rapidly every year. Traditional computing techniques are not enough to process that much large amount of data. Hadoop is a bunch of technology & have capacity to store large amount of data on Data nodes. Hadoop uses MapReduce algorithm to proces...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Solving the Unconstrained Optimization Problems Using the Combination of Nonmonotone Trust Region Algorithm and Filter Technique

In this paper, we propose a new nonmonotone adaptive trust region method for solving unconstrained optimization problems that is equipped with the filter technique. In the proposed method, the various nonmonotone technique is used. Using this technique, the algorithm can advantage from nonmonotone properties and it can increase the rate of solving the problems. Also, the filter that is used in...

متن کامل

Design of IIR Digital Filter using Modified Chaotic Orthogonal Imperialist Competitive Algorithm (RESEARCH NOTE)

There are two types of digital filters including Infinite Impulse Response (IIR) and Finite Impulse Response (FIR). IIR filters attract more attention as they can decrease the filter order significantly compared to FIR filters. Owing to multi-modal error surface, simple powerful optimization techniques should be utilized in designing IIR digital filters to avoid local minimum. Imperialist compe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013